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Abstract 

This report describes a two level parallelization of a Computational Fluid Dynamic 
(CFD) solver with multi-zone overset structured grids. The approach is based on a 
hybrid MPI+OpenMP programming model suitable for shared memory and clusters 
of shared memory machines. The performance investigations of the hybrid application 
on an SGI 0rigin2000 (02K) machine is reported using medium and large scale test 
problems. 

1 Introduction 

The advent of multiprocessing hardware and software technologies has introduced new chal- 
lenges into high-performance computing (HPC). Various parallel programming paradigms 
have been developed on distributed memory (DM) and distributed-shared memory (DSM) 
systems. Some widely used models are programs based on message passing, such as MPI 
[20], suitable both on DM and DSM architectures. Other popular paradigms are based on 
parallel compiler directives, such as OpenMP [11] and HPF [5], where the former exploits 
shared memory parallelism, and the latter takes advantage of data parallelism. Most of the 
parallel application codes based on the above approaches use only a single level of parallelism. 
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With the current trend in HPC hardware towards clusters of shared-memory symmetric 
multi-processor (SMP) compute nodes, hybrid programming techniques have been intro- 
duced that exploit parallelism beyond a single level. The main thrust of these methods is to 
combine coarse and fine grain parallelism; this is obtained by a domain decomposition and 
loop-level parallelism, respectively. Communication of data across SMP clusters is achieved 
via message passing or shared memory referencing, and loop level parallelization by compiler 
directives. Hybrid approaches have also been applied to various other scientific disciplines, 
(see e.g. [15] and [8] for hybrid MPI+OpenMP approaches). Our main effort here is to 
identify multi-level parallelism in a given sequential code, and to incorporate the MPI and 
OpenMP programming models and interfaces into that code. This effort is substantially less 
programming intensive for applications whose computational regions already enjoy a natural 
domain decomposition structure. 

A hybrid methodology, named “Shared Memory Multi-Level Parallelism” (MLP) [19] and 
developed at NASA Ames, uses a fundamentally different approach. MLP exploits shared 
memory for all data communication via direct memory referencing instructions. MLP has 
been incorporated into two of NASA’s multi-zonal CFD solvers and into a climate modeling 
code. The performance efficiency of MLP codes has been reported to be very successful on 
NASA’s SGI single image system, 512 CPU R12K processors. The implementation of MLP 
programming is significantly simpler than the hybrid MPI+OpenMP, in that no extensive 
message passing library functions are used. In contrast, the MLP library, also developed 
at NASA Ames, consists of a few routines based on UNIX calls. A comparison of 0\ ER- 
FLOW performance assessment, using MLP versus the MPI-based code, can be found in 
references [4] and [3]. MLP is currently limited to shared memory architectures. 

This report describes the implementation of a hybrid MPI+OpenMP programming model 
into NASA’s high fidelity overset-grid CFD solver, OVERFLOW-D [9, 10]. The objective 
is to study the scalability of various parallel schemes implemented in this code on differ- 
ent architectures, shared memory and SMP cluster systems, such as 02K and IBM SP. 
OVERFLOW-D is based on a version of the aerodynamic flow solver OVERFLOW [12], 
which is designed for complex configurations with static grid systems. The former was pri- 
marily developed for moving-body (dynamic) grid systems. 

Implementation of our hybrid approach into either the dynamic or the static version of 
the code would have been similar, but we chose the dynamic version for reasons of avail- 
ability on several CFD problems of interest at the start of this work. We want to study 
the performance of our application code as it is applied to large scale test cases, which are 
more representative of practical problems. Such problems consist of large scale grid systems. 
Generation of large scale grids for the static application would have been very costly and 
time consuming. OVERFLOW-D has the capability of automatically generating its own 
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off-body (background) grids that are compatible with near body grids. 

The remainder of this report includes a brief overview of the numerical method, §2, and 
the hybrid parallelization strategy, §3. Some performance results are presented in §4. The 
test problem is related to the CFD simulation of complex vortex dynamics. 

The results demonstrate the functionality of the hybrid algorithm implemented. The 
performance results, however, are preliminary and reflect our initial experiences with the 
new algorithm on an SGI 02K machine. A discussion of future work, a summary and a 
conclusion are given in §5. 


2 Numerical Method 

In this section we will briefly review the basics of the flow solver and domain connectivity 
on structured overset grids. 


2.1 Flow Solver 


The multi-block, or “multi-zone” , overset CFD code, OVERFLOW [12], is popular for high- 
fidelity complex aerodynamic shapes consisting of multiple geometric components, in which 
individual blocks of body-fitting overlap grids can be constructed easily about each compo- 
nent. Grid blocks are either attached, near-body, or detached (off-body). The latter is also 


called a background or a wake grid. The union of near and off-body blocks covers the entire 
computational domain known as the “Chimera” [18] style decomposition, which falls into 
the general category of a Schwartz domain decomposition. The code is a Reynolds averaged 
Navier-Stokes software, augmented with a number of turbulence models. The dynamic code, 
OVERFLOW-D, simplifies the modeling of bodies in relative motion. For example, in typical 


rotory-wing problems, the near-field is modeled with one or more grids generated about the 
moving rotor blades. The code automatically generates cartesian wake grids, called bricks, 
that encompass the curvilinear near-body grids. At each time iteration, flowfield equations 
are solved independently on each grid zone in a sequential manner. Overlap boundary (m- 


tergrid) data is updated from previous solutions prior to the current time step. Updates are 


furnished by a Chimera interpolation procedure. 

The code uses finite differences in space, and implicit time-stepping on structured grids. 
A variety of time-stepping and spatial differencing schemes are available. For steady-state 
problems, faster convergence is achieved by a spatially varying virtual time increment, which 
is chosen based on a CFL (Courant-Friedrichs-Lewy) number. The code offers a number of 
user-specified solution algorithms, such as three-factored block tridiagonal, penta diagonal, 
and Lower-Upper Symmetric Gauss-Siedel (LU-SGS) [17] schemes. Upgrades [13] have been 
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incorporated into newer versions of the dynamic code to improve the temporal and spatial 
accuracy of the solution scheme. They consist of higher order spatial differencing and the 
addition of a subiteration scheme at each time-step to reduce factorization errors. 


2.2 Overset-grid Connectivity 

A domain connectivity program is used to determine the intergrid boundary data, (see Fig. 
1), which consists of “holes” and outer boundary points on each grid. Holes are cut in grids 
which intersect solid surfaces, such as when a portion of an overset grid lies inside a physical 
body. Adjacent grids are expected to have one-cell overlap to ensure continuity , for higher 
order accuracy, a two-cell overlap is sought [13]. In the static version, the coordinates of 
interpolated data are accessed from a pre-processed data file prior to the start of the time 
step loop. Unlike the static case and due to the relative motion of the grids, a Domain 
Connectivity Function (DCF) [16] program within the dynamic version is used to compute 
intergrid donor points that will be supplied to other grids, creating “holes as needed. The 
DCF procedure is fully coupled with the flow solver. 


3 Hybrid Programming Model 

In the following subsection we describe a simple flexible hybrid parallel design which imple- 
ments a combined MPI+OpenMP model, a two level parallelism, into the OVERFLOW-D 
application code. The code is typical of multi-zonal CFD code; the approach discussed 
here can virtually be implemented on all such CFD or other scientific applications. The 
OVERFLOW-D solver has already been parallelized [21] via an MPI distributed style algo- 
rithm, adopting the SPMD (Single Program Multiple Data) style, and is suitable on both 
the DM and DSM platforms. 

Efforts for development of the current hybrid approach primarily consist of integrat- 
ing and interfacing the OpenMP programming into the OVERFLOW-D code. The task is 
straightforward, and involves a focused analysis to identify the potential loops in the ap 
plication code suitable for OpenMP. For large codes, such as for our application, manual 
analysis to select such loops is time prohibitive and error prone. At the time of this work, 
a recently developed NASA compiler-based automatic parallelization tool, CAPO [7], has 
become available. In this work we have made use of this tool along with some manual effort 

to parallelize some of the loops in OVERFLOW-D. 

The combined implementation permits the execution of the OVERFLOW-D code in 
pure message passing or, as in the true “hybrid” model, with multiple threads per MPI 
process. The former extends the code capability to run on clusters of SMP. In the following 
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subsections §3.1 and §3.2, we briefly review the MPI and OpenMP programming models of 
the solver part of the OVERFLOW-D code. 

3.1 MPI Implementation 

MPI implementation has been developed specifically around the sequential version of OVERFLOW- 
D and around the multi-block feature of the code, which offers a natural coarse grain par- 
allelism. The main computational logic at the top level of the sequential code consists of 
a “time-loop”, “grid-loop”, and a “subiteration loop”. The last two loops are nested in 
the time loop, respectively. Solutions are obtained on the individual grids with imposed 
boundary conditions, where the Chimera intergrid boundaries are updated successively at 
the completion of the solution on each grid in the sequence. Upon completion of the grid- 
loop, the solutions are automatically advanced to next time-step. The overall procedure may 
be thought of as a “Gauss-Seidel” iteration. 

To facilitate parallel execution, a strategy is required to cluster grid components into 
groups. Each group may contain several grid zones. The grouping is based on a bin-packing 
strategy that seeks to maintain an even number of total grid points per group, and at the 
same time retain a degree of connectivity among the grids within a group. This issue is 
subject to load balancing among parallel processors and will be addressed in another work. 

The main change in the logic of the sequential code is to subdivide the grid-loop into 
two loops, a loop over groups and a loop over the grids within each group. The outer 
“group-loop” is done in parallel and contains the grid-loop. One MPI process is assigned to 
each group, with the total number of groups, N M pi- P roc , equal to the total number of MPI 
processors invoked at the execution of the code. The intergrid update among grids within 
each group, named intra-group, is done similarly to the serial case. Chimera updates are 
also neccessary for overlapping grids across group boundaries, known as inter-group data 
exchanges, (see Fig. 1). The supply of inter-group donor points from grids in, say, the n th 
group to grids of the m th group, m = 1, Nmpi-ptoc , is stored in a send array that will be 
exchanged by MPI calls. The inter-group exchanges are done at the beginning of a current 

time-step based on the interpolation data of the previous one. 

In the hybrid mode, with multiple threads assigned for each MPI process, only the master 
thread is responsible for data exchanges as discussed below. MPI processes are synchronized 
at the completion of the solution over each group. In addition, for problems with moving 
grids, the DCF program must be invoked to compute new donors prior to the following 

time-step. 
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Figure 1: Overset grid connectivity, grouping, interpolation, and data exchanges. 

3.2 OpenMP Implementation 

The MPI-based scheme only accounts for one level of parallelism; a second level is exploited 
by fine parallelism using OpenMP compiler directives. Compilers on certain machines, such 
as 02K, have automatic parallelization options, but the loops chosen tend to be the very 
innermost loops. This tendency reduces the efficiency of multi-threading. The OpenMP 
[11] approach offers a superior alternative. The main effort of OpenMP implementation is 
first to identify parallel loops and parallel regions into the code, and then to insert OpenMP 
directives along with the proper list of the privatizable and shared variables. 

Manual analysis would be tedious and very time-consuming for large programs such 
as OVERFLOW-D, which consist of over 100,000 lines of FORTRAN and approximately 
1000 subroutines. We have employed the automatic parallelization tool CAPO [7] to detect 
parallelism and insert directives in the code. CAPO is based on the CAPTools [6] toolkit 
and makes full use of CAPTools unprecedented interprocedural analysis for determining data 
dependence. CAPO allows a single level of OpenMP for multiple nested loops. The use of 
directives, such as \%OMP PARALLEL DO , are most effective if they can be inserted at 
the outermost loops, to ensure large granularity and small overhead. 

The hybrid code structure at the top is similar to the MPI code (see §3.1), consisting of 
the time-loop, group-loop and grid-loop. The group-loop runs in parallel via MPI, but the 
grid-loop, which contains the computationally intensive part of the code, is multi-threaded by 
OpenMP. Currently, an equal number of threads, N thrd , are spawned per each MPI process. 
In the hybrid mode, for a total number of, N C pu, processors, MPI spawns, N M pi- P roc , pro- 
cessors at the initial level of the main program, followed by N thrd OpenMP threads, subject 
to the following constraint, Ncpu = E t mpi-ptoc * N t hrd • The OpenMP thread initialization 
follows a “fork/join" program. One of the threads amid N thrd acts as the master thread and 
the others as team members. The master in each MPI process acts as the MPI processor. 
In the absence of a parallel construct, the master thread executes in serial mode while the 
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Figure 2: Schematic of the hybrid MPI+OpenMP implementation; master-thread MPI com- 
munication and parallel OPenMP computation. 

other threads remain idle. The inter-group data exchanges across MPI processors are done 
only by the masters within a MPI process. 

In the hybrid implementation, a sequential communication strategy is adopted. The 
packing and unpacking of messages are done in a thread-parallel fashion, but the master 
thread only sends/receives the messages using MPI calls. There is no inter-group cross 
communication among the threads. This strategy may be a source of sequential bottlenecks 
depending upon the load of communications. In addition to MPI synchronization calls, 
OpenMP threads ought to be synchronized at the end of every parallel construct. Fig. 2 
illustrates the schematic of the hybrid MPI+OpenMP implementation for two MPI processes 
and four OpneMP threads in OVERFLOW-D. Mater-threads within each MPI process are 
exchanging inter-group data. 

Load balancing in the hybrid code is static. It is furnished by the grouping strategy with 
the same characteristics as in the MPI code, §3.1, at start of the first time-step. A dynamic 
load balancing strategy can be sought, (not implemented in this work), by a varying number 
of Nthrd in proportion to the workload on the MPI processes. The workload can be evaluated 
by the computation time per group for the first few time-steps at runtime. 

The overall activities required to generate the OpenMP constructs in OVERFLOW-D 
have been a result of a judicial usage of CAPO along with some manual analysis and mod- 
ification. The hybrid-based code consists of over 1000 OpenMP parallel constructs, whose 
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Figure 3: Illustration of a pipeline parallelization method for LU-SGS. 


coding is spread out over nearly 400 subroutines. Some parallel constucts within the DCF 
part of the code had to be modified and/or removed for efficiency and/or debugging. The 
DCF portion of the code essentially runs in serial mode. Several parallel constructs in subrou- 
tines pertinent to the flow solver have been modified; for instance, certain data dependecies 
were removed in some “paralle do loop” constructs. In addition, parallel constructs in ap- 
proximately 80 subroutines were found to be interfering with the memory allocation/freeing 
procedure. This occassionally caused a hung program when run on 02k machines with 
Nthrd > 2. All these subroutines were modified leading to stable hybrid execution. 

One of the code sections that involved manual modifications is the LU-SGS linear solver. 
The LU-SGS scheme [17] combines the advantages of LU factorization and Gauss-Siedel 
relaxation to improve the numerical convergence rate. But the inherited data dependences 
in the scheme require the availability of the solution on the previous diagonal line for each 
diagonal line in the solution process. The ‘‘hyper-line” algorithm, similar to the hyper- 
plane” algorithm [2], was used in the original code to achieve reasonable parallel performance 
on the vector machine. However, for a cache-based machine like 02K there are two main 
limitations on the algorithm: poor cache utilization and small communication granularity. 
In fact, our first version of the OpenMP LU-SGS code generated by CAPO performed very 
poorly, achieved a speedup of 1.2 on 4 CPUs for a small test case. The poor performance 
was a direct consequence of the original code structure for which CAPO was only able to 

insert OpenMP directives to some of the inner loops. 

A better approach to parallelize the LU-SGS scheme is the pipeline algorithm described 
in [22]. To illustrate the pipeline method Fig. 3 shows a case of a 1-D pipeline in which the 
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data grid is partitioned in the K dimension among four threads (or processors). Thread 0 
starts from the low-left corner and works on one slice of data for the first L value. Other 
threads are waiting for data to be available. Once thread 0 finishes its job, thread 1 can start 
working on its slice for the same L and, in the meantime, thread 0 moves onto the next L. 
This process continues until all the threads become active. Then they all work concurrently 
to the opposite end, as indicated by the large arrow in the figure. The pipeline algorithm has 
better cache performance and less communication cost than the hyper-plane algorithm. We 
restructured the LU-SGS code in a way that CAPO could automatically exploit pipelines 
with OpenMP directives. The new parallel version of LU-SGS not only improved the parallel 
performance, with a speedup of 2.9 on 4 CPUs for the same test case mentioned earlier, but 
also the sequential performance. 


4 Performance Results 

Several experiments have been conducted to demonstrate the functionality and accuracy 
of the hybrid OVERFLOW-D code, using performance results from two main test cases. 
The physics of one of the test problems is related to the Navier-Stokes simulation of vortex 
dynamics in a complex vortical wake flow of hovering rotors, whose solutions were extensively 
studied in reference [14]. All the performance computations have been obtained on an SGI 
02K Shared Memory machine, using 512 MIPS R12K processors with a clock frequency of 

400 MHz. 

The test cases consist of a medium and of a large overset grid systems as follows: 

• CASE 1 grid system consists of 41 grid zones, with a total of approximately 8 million 
grid points. 

• CASE 2 grid system consists of 857 grid zones, with a total of approximately 68 million 
grid points. To the best of our knowledge, this test case is the largest MPI/OpenMP 
application currently being evaluated on the 02K platform. 

Solutions are obtained over many time-steps to confirm the stabilty of the hybrid code. 
Residuals of the solutions are compared with the MPI code alone for accuracy. For the 
performance experiments, however, computations are only carried over 100 time-steps, and 
the timing mainly reflects flow computations. The DCF computation is only invoked at the 
initial time-step and its impact over 100 times steps is minimal. 

Runtimes (in seconds), denoted by T exe , are normalized per each time iteration step, and 
reflect the sum of computation and communication, respectively. Furthermore, timings are 
averaged across the number of N MPI - pr oc processes used in each run. The “speedup” and 


9 



“efficiency” reported below are defined as a ratio of “base-runtime” to Ncpu runtime, and 
as a percentage of the actual speedup to the theoretical maximum speedup, respectively. 
The base- runtime for each test case here is considered to be the timing T exe , for the smallest 

Ncpu- 

The number of combinations of Nmpi-ptoc * N t hrd that would yield the same Ncpu can 
be limited by the number of groups, which is the same as Nmpi-ptoc- This can not be larger 
than the total number of grid zones in the problem, otherwise some group would be empty. 
In addition, for reasons of load balancing, it may be necessary to keep the number of groups 
less than the number grid zones. 

4.1 CASE 1 

Table 1 shows performance results for the hybrid implementation of OVERFLOW-D on the 
02K, using the CASE 1 test problem. Due to the application’s memory space requirement, 
the least number of processors necessary to run the code is Ncpu = 4 - Runtimes are reported 
for some combinations of Nmpi-ptoc * Nthrd , speedup and parallel efficiency data are given 
for the best of these combinations. 

Overall speedup increases while the parallel efficiency is decreasing. For N C pu > 128 the 
speedup is insignificant. The changes in the value of runtimes for combinations consisting 
of the same value for N M pi-proc , but varying N thrd , scales reasonably for N thrd » 4. For 
instance for N M pi- P roc = 16 , the ratio of runtimes for N thrd = 1 to N thrd = 4 is 2.5, which 
is ~ 62% of the ideal value, whereas for Nthrd = 8, the ratio is ~ 33%. Comparison of 
runtimes within Nqpu rows on the table, consisting of combinations of Nmpi — proc * Nthrch 
shows efficiency increases for Nmpi-ptoc > 8. Furthermore, the best of hybrid runs with 
Nmpi-ptoc >8, are in the same ball park as runs with N thrd = 1 , for the same N C pu- 

A detailed analysis can be complex, for there are certain parameters that have to be 
assessed. One is load balancing among the MPI processes. As discussed earlier, see §3, the 
current load balancing strategy is static. Load imbalance may stem from both computation 
and communication. For a run with Nmpi-ptoc , an optimum number of N M pi-proc should be 
sought. Computational load imbalance mainly depends upon the initial grouping strategy. 
However, in the hybrid mode, computation may further be imbalanced by the anomalies 
which may exist between CPU assignment and memory placement. The memory should be 
placed on the node on which the pertinent MPI assigned N thrd , is running. 

For combination runs, with Nmpi-ptoc = 16 and Nthrd =1,2,4, and 8, the relative speed 
up is nonlinear and reverses from 4 to 8. The nonlinear behavior is also verified based on 
the computational time (not shown on the table), the cause of which may head us to issues 
such as, memory placement, cache line optimization, or parallelization strategy. As discussed 
in §3.2, the main computational task in the hybrid code is performed under OpenMP par- 
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Table 1: Runtimes (in seconds) of the hybrid implementation on 0‘2K using test problem, 
CASE 1. 



Hybrid 

Ncpu 

N A f PI — proc 

Nthrd 

T exe 

Speedup 

Efficiency 

4 

4 

1 

24.6 

1. 

100 

8 

2 

4 

17.2 




8 

1 

14.2 

1.73 

87 

16 

2 

8 

14.6 




4 

4 

12.8 




8 

2 

10.5 




16 

1 

9.6 

2.65 

64 

32 

4 

8 

10.1 




8 

4 

6.4 




16 

2 

5.9 

4.17 

52 


32 

1 

5.9 



48 

8 I 

6 

5.8 



48 

16 

3 

4.2 



64 

8 

8 

4.0 




16 

4 

3.8 

5.86 

49 

i 

32 

2 

3.8 



128 

8 

16 

5.8 



128 

16 

8 

3.5 




32 

4 

2.6 

9.46 

30 


allelism. A large portion of computational time is consumed by the linear solution scheme, 
LU-SGS, in this case. The fraction of reduction of computational time, in this case, is not 
small due to the fact that the LU-SGS solver lends poorly to parallelization. 

Table 2 compares runtimes of the hybrid code with N t hrd = 1 to the corresponding 
“MPI-alone” code. The MPI-alone refers to computations made by the MPI code with no 
OpenMP directives being invoked. Due to the overhead associated with the hybrid code, its 
corresponding runtime is expected to be 5 to 10% larger than for MPI-alone. For the most 
partthis is true for Ncpu > 8. 

4.2 CASE 2 

Table 3 shows performance results of the hybrid code for some combination of MPI processes 
and OpenMP threads. For the size of the problem in CASE 2, the base-runtime is obtained 
based on the smallest Nqpu = 56. Because of this, and since the maximum available proces- 
sors on the 02K machine are 498 CPUs, the number of combinations of N M pi- pr0C * N thr d 
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Table 2: Runtimes (in seconds) comparison of the hybrid with N thrd = 1 and MPI-alone for 
CASE 1. 


Ncpu 

Hybrid 

MPI-alone 

T'exe 

Texe 

4 

24.6 

31.4 

8 

14.2 

15.4 

16 

9.6 

9.0 

32 

5.9 

5.3 

41 

5.9 

5.3 


for this experiment is limited. Speedup increases while the parallel efficiency is decreasing 
to 41% at Ncpu — 448. The hybrid implementation with N t hrd > 1 outperforms the hybrid 
with N t hrd = 1, for the same N C pu ■ Some issues related to the hybrid runs with N th rd = 1 
as compared to MPI-alone are discussed below. Combinations with Ncpu = 56 and varying 
numbers of N thrd show nonlinear speedup, with almost none between 4 to 8 threads. Again, 
as stated in CASE 1, this may relate to issues of memory placement, etc. 

The number of messages and volume of communication per MPI process is quite large 
for this test case. The communication times (not shown here) are about 30 to 40% of 
the total execution times. Threads introduce additional overhead related to creation and 
synchronization, which increases the overall communication time. 


Table 3: Runtimes (in seconds) of the hybrid implementation on the 02K using CASE 2. 



Hybrid 

Ncpu 

N MPI—proc 

N thrd 

Texe 

Speedup 

Efficiency 

56 

56 

1 

19.1 

1. 

100 

112 

56 

2 

14.6 




112 

1 

14.0 

1.36 

68 

224 

56 

4 

9.5 




112 

2 

9.2 




224 

1 

8.9 

2.14 

63 

336 

56 

6 

9.3 




112 

3 

6.6 

2.89 

48 


336 

1 

8.1 



448 

56 

8 

9.1 




112 

4 

5.8 

3.29 

41 


224 

2 

7.1 




448 

1 

8.2 
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Table 4 compares performance of the hybrid code with N thrd = 1 to the with the MPI- 
alone, using CASE 2. As mentioned in §4.1, inclusion of OpenMP directives would amount 
to a 5 to 10% increase in the runtime. However, the data on this table shows a significant 
increase of about 70% in runtimes. The exact cause of this anomally is not known to us at this 
time, but it may be an indication of some strong interaction between MPI and OpenMP. It 
appears that the problem is magnified when large numbers and volumes of messages attempt 
to communicate across the processors. The 02K system utilty software “ssrun” has been 
used to profile both the hybrid and MPI-alone programs, using N C pu = H2. An extensive 
one-to-one comparison of captured performance data between the two codes shows that 
timing on all MPI calls is almost doubled for the hybrid implementation. Both the hybrid 
and MPI-alone codes are identical, except in the compilation of the latter, no OpenMP 
“-mp” option is invoked. 

Table 4: Runtimes (in seconds) comparison of hybrid with N thrd = 1 and MPI-alone using 
CASE 2. 


Ncpu 

Hybrid 

MPI-alone 

Texe 

Texe 

56 

19.1 

15.6 

112 

14.0 

8.6 

224 

8.9 

6.0 

336 

8.1 

4.7 

448 

8.2 

4.9 
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5 Conclusions and Future Work 

Current parallel implementation on the multi-zonal CFD code, OVERFLOW-D, supports 
MPI and a dual-level hybrid MPI+OpenMP strategy on shared memory machines and clus- 
ters of SMP. Results were only tested on shared memory SGI 02K machines. The hybrid 
approach delivers about the same performance as the MPI-alone, for the same total number 
of processors. The hybrid implementation, however, can outperform the MPI-alone, when 
the MPI-base code can not run beyond a certain number of groups, and is limited by the 
number of grid zones and related load balances. This conclusion is verified for CASE 1, 
above. 

With respect to CASE 2, more experiment is needed to understand the strong interac- 
tion which occurs between OpenMP and MPI. This interaction affects the overall timing 
on all MPI calls and is particularly noticable when results of the hybrid code, running with 
one thread, are compared with results of the MPI-alone, for the same number of MPI pro- 
cesses. These results may be dependent upon the hardware and vendor implementation of 
the parallel libraries. 

Future work should focus on a more detailed analysis of the hybrid code scaling perfor- 
mance. Linear solution algorithms, other than LU-SGS, should be efficiently parallelized and 
tested. Certain OpenMP parallel constructs in the hybrid code need improvement that could 
not be visited under the current time constraints. On the SGI 02K platform, an attempt 
should further be made to investigate the effect of the placement of processes and memory, 
as discussed in §4.1. The cause of OpenMP interference with MPI on the 02K machine, dis- 
cussed in §4, should also be verified. Cache optimization of fine level loops may significantly 
enhance the scaling performance of the hybrid code. It is essential that a dynamic load 
balancing technique be implemented into the hybrid code. A detailed analysis is required 
to understand the impact of various parameters on the parallel performance. Further future 
work should include porting and testing of the hybrid approach on other platforms, such as 
IBM SP. 
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